The Parallel Maximal Cliques Algorithm for Protein Sequence Clustering
نویسندگان
چکیده
Problem statement: Protein sequence clustering is a method used to discover relations between proteins. This method groups the proteins based on their common features. It is a core process in protein sequence classification. Graph theory has been used in protein sequence clustering as a means of partitioning the data into groups, where each group constitutes a cluster. Mohseni-Zadeh introduced a maximal cliques algorithm for protein clustering. Approach: In this study we adapted the maximal cliques algorithm of Mohseni-Zadeh to find cliques in protein sequences and we then parallelized the algorithm to improve computation times and allowed large protein databases to be processed. We used the N-Gram Hirschberg approach proposed by Abdul Rashid to calculate the distance between protein sequences. The task farming parallel program model was used to parallelize the enhanced cliques algorithm. Results: Our parallel maximal cliques algorithm was implemented on the stealth cluster using the C programming language and a hybrid approach that includes both the Message Passing Interface (MPI) library and POSIX threads (PThread) to accelerate protein sequence clustering. Conclusion: Our results showed a good speedup over sequential algorithms for cliques in protein sequences.
منابع مشابه
Large Scale Metagenomic Sequence Clustering via Sketching and Maximal Quasi-clique Enumeration on Map-Reduce Clusters
Taxonomic clustering of species from millions of DNA fragments sequenced from their genomes is an important and frequently arising problem in metagenomics. High-throughput next generation sequencing is enabling the creation of large metagenomic samples, while at the same time making the clustering problem harder due to the short sequence length supported and sampling of hitherto unknown species...
متن کاملCoupling graph perturbation theory with scalable parallel algorithms for large-scale enumeration of maximal cliques in biological graphs
Data-driven construction of predictive models for biological systems faces challenges from data intensity, uncertainty, and computational complexity. Data-driven model inference is often considered a combinatorial graph problem where an enumeration of all feasible models is sought. The data-intensive and the NP -hard nature of such problems, however, challenges existing methods to meet the requ...
متن کاملFinding All Maximal Cliques in Dynamic Graphs
Clustering applications dealing with perception based or biased data lead to models with non-disjunct clusters. There, objects to be clustered are allowed to belong to several clusters at the same time which results in a fuzzy clustering. It can be shown that this is equivalent to searching all maximal cliques in dynamic graphs like Gt = (V,Et), where Et−1 ⊂ Et, t = 1, . . . , T ;E0 = φ. In thi...
متن کاملزمانبندی دو معیاره در محیط جریان کاری ترکیبی با ماشینهای غیر یکسان
This study considers scheduling in Hybrid flow shop environment with unrelated parallel machines for minimizing mean of job's tardiness and mean of job's completion times. This problem does not study in the literature, so far. Flexible flow shop environment is applicable in various industries such as wire and spring manufacturing, electronic industries and production lines. After modeling the p...
متن کاملCLIMP: Clustering Motifs via Maximal Cliques with Parallel Computing Design
A set of conserved binding sites recognized by a transcription factor is called a motif, which can be found by many applications of comparative genomics for identifying over-represented segments. Moreover, when numerous putative motifs are predicted from a collection of genome-wide data, their similarity data can be represented as a large graph, where these motifs are connected to one another. ...
متن کامل